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1. Introduction 

In 1867, at the dawn of statistical physics, Maxwell imagined a thought experiment 
that has both troubled and inspired physicists ever since jTj. In modern language, the 
issue is that traditional thermodynamics posits a strict separation between observable 
macroscopic motion (dynamical systems) and unobservable degrees of freedom (heat). 
But imagine—as can now be done experimentally on small systems where fluctuations 
are important—that it is possible to observe some of these hidden degrees of freedom. 
(Maxwell’s thought experiment used a “demon” to accomplish the same task.) In 
any case, the entropy of the system is reduced, and one can use the lower entropy 
to extract work from the surrounding heat bath, in seeming violation of the Second Law 
of thermodynamics. 

This blurring of macroscopic and microscopic degrees of freedom has led to a 
new field, stochastic thermodynamics , which clarifies how thermodynamics should be 
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applied to small systems where fluctuations are observable and important [2]. As we 
will see below, the nature of information acquired about the fluctuations—especially the 
precision with which they are measured and the time they become available—is of great 
importance. Indeed, information is itself a thermodynamic resource, and stochastic 
thermodynamics can be extended to accommodate the acquisition, dissipation, flow, 
and feedback of information j3[ 0] 3 ;6j :7; El |9[ HO] Fl. [12, T3j 1J. [15]. For a recent 
review, see [T6] . 

The goal of the present contribution is to combine ideas from control theory (state 
estimation) [IT] [T8J with ideas from computer science about hidden Markov models 
[T9l 1201I2T11221123] in order to explain some recent surprising observations from stochastic 
thermodynamics about how Maxwell’s demon operates in the presence of measurement 
errors [24]. As a bonus, the formalism we discuss suggests a number of interesting areas 
where the stochastic thermodynamics of information may be extended. 


2. Coarse graining and discrete state spaces 

In the simplest non-trivial example of a discrete state space, a state x can, at each 
discrete time point, take on one of two values, for example —1 and +1. While systems 
such as spin-1 particles are inherently discrete, a broad range of physical systems— 
even classical, continuous state spaces—can often be well approximated by discrete 
systems after coarse graining. Figure [Q[a) sketches such a system, a protein in solution 
that alternates between a loose unfolded (—1) and a compact folded (+1) state. Other 
biological examples of two-state systems include ion channels that can be open or closed, 
gene-transcription repressor sites that can be occupied or empty, and sensory receptors 
that can be active or silent (chapter 7 in [25]). 



Figure 1 . Coarse graining to find a Markov model, (a) A protein in water alternates 
between two conformations, (b) A one-dimensional projection of the dynamics. White 
vertical line denotes threshold separating the ±1 states, (c) Graphical depiction of a 
symmetric two-state Markov chain. 


Figure [D illustrates schematically how to coarse grain from a physical situation, such 
as a protein in water, to a discrete-time Markov model. In (a), we depict two states 
of the protein, labeled “unfolded” and “folded” or, equivalently, —1 and +1. The word 
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“state” is here a shorthand for “macrostate” and is associated with many microstates, 
each of which corresponds to a slightly different protein conformation that preserves the 
general property in question. In (b), we project the full dynamics onto a one-dimensional 
subspace modeled by a double-well potential. States with x < 0 are classified as —1, 
and states with x > 0 are classified as +1. The symmetry of the potential implies 
that the protein spends equal time in the two states, which is a special situation. In 
(c), we show a graphical depiction of the discrete, two-state Markov chain dynamics, 
where in a time r, states remain the same with probability 1 — a and hop to the other 
with probability a. In order for a two-state description to reasonably approximate the 
dynamics, the dwell time spent in each well must be much longer than the time scale 
for fast motion within a well. This holds when a single energy barrier separates two 
states and whose height is much larger than kT. 

Why might we want to approximate physical systems by discrete state spaces? 

• Clarity. We can isolate just the important degrees of freedom, letting the others 
be uncontrolled and even unobserved. 

• Simplicity: The mathematical description is more straightforward. 

• Generality: Any dynamics that can be modeled on a computer is necessarily 
discretized in both time and state. 

3. Markov chains 

Let us briefly recall the basics of discrete-state-space systems in discrete time. Consider 
a system described at time k by a state Xk that can be in one of n possible states, 
indexed by the values 1 to n. The index is distinguished from its value , which, for a 
two-state system, might be {±1}, {0,1}, or even {left, right}. Let P(xk = i) be the 
probability that, at time k, the system is in the state indexed by %. The distribution is 
normalized by enforcing Y^i=\P{ x k = *) = 1 or, more succinctly, ^2 Xk P(xk) — 1 - For 
dynamics, we consider Markov chains , which are systems with discrete time and discrete 
states. The Markov property implies that the next state depends only on the current 
state, as illustrated graphically in figure [21 which may be compared with figure QJc). 



Figure 2. Markov model graphical structure. The state Xk+i depends only on Xk- 


For Markov chains, the dynamics are specified in terms of an n x n transition matrix 
A whose elements A,j = P(xk+ 1 = i \ Xk = j) satisfy 0 < A t] < 1. That is, A t j gives the 
rate of j —> i transitions. For example, a general two-state system has 


A = 


1 — ao 




CL\ 


1 — ai 



( 1 ) 
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Notice that the columns of A sum to 1, as required by the normalization of probability 
distributions. In words, if you start in state j then you must end up in one of the n 
possible states, indexed by i. Figure [0(c) depicts (JTj) graphically, with a 0 = cq = a. A 
matrix with elements 0 < Aij < 1 and JA A t j = 1 is a (left) stochastic matrix. 

Define the n-dimensional stochastic vector p/,., whose elements pf' 1 = P(x k = j) 
give the probability to be in state j at time k. Then 0 < < 1 and YljPk^ = 1 and 


Pkii = ^P(x k+1 = i,x k = j ) = i 

3 =1 3 = 1 S 


= l\Xk= 


^ i i 


j)P{Xk=j) = ^2/AijPk ■ 

3 = 1 


( 2 ) 


More compactly, pfc+i = A p*,, a linear difference equation with solution p*, = 
A fc p 0 known as the discrete-time master equation. Often, we seek the steady-state 
distribution, defined by p = Ap. One way to find p is to repeatedly iterate (J2]); 
another is to note that the steady-state distribution of probabilities corresponds to the 
eigenvector associated with an eigenvalue equal to 1. A stochastic matrix must have 
such an eigenvalue, since A — I is a matrix whose columns all sum to zero. They are 
then linearly dependent, with zero determinant. 

For example, the two-state Markov model with transition matrix A given by (JT|) 
has eigenvalues A = 1 and 1 — (ao + a i). The normalized eigenvector corresponding to 
A = 1 is 


P 


* 


i 

ao+ai 



(3) 


For the symmetric case, ao = cq = a and p* = ( g 5 ) , independent of a. By symmetry, 
both states are a priori equally probable. 


4. Hidden Markov models 

Often, the states of a Markov chain are not directly observable; however, there may 
be measurements (or emitted symbols ) that correlate with the underlying states. The 
combination is known as a hidden Markov model (HMM). The hidden states are also 
sometimes known as latent variables [22]. The observations are assumed to have no 
memory: what is measured depends only on the current state, and nothing else. The 
graphical structure of an HMM is illustrated in figure [4] 

In the example of proteins that alternate between unfolded and folded states, the 
molecule itself is not directly observable. One way to observe the configuration is to 
attach a particle to one end of the protein and anchor the other end to a surface 26 1 . 
as illustrated in figured]) a). As the protein folds and unfolds, the particle moves up and 
down from the surface. We can illuminate the region near the surface using an evanescent 
wave via the technique known as total internal reflection microscopy. The intensity I(z) 
of light scattered by the bead at height z from the surface will decrease exponentially as 
I(z) oc e~ z P°, with zq ~ 100 nm. The two states will then correspond to two different 
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Figure 3. HMM graphical structure. The states Xk form a Markov process that is 
not directly observable. The observations yk depend only on Xk- 


scattering intensities. The observation yk is the number of recorded photons, integrated 
over a time that is shorter than the dwell time in each local potential well. 



(a) (b) 


Figure 4. Markov vs. hidden Markov models, (a) Schematic illustration of a 
scattering probe of protein conformation where evanescent-wave illumination changes 
the intensity of scattered light in the two states, (b) Observations for a two-state 
Markov process where observations correlate unambiguously with states (top) and a 
hidden Markov process (bottom) where conditional distributions overlap. True state 
in light gray. Observations yk are indicated by round markers and have Gaussian 
noise, with standard deviation a = 0.2 (top) and 0.6 (bottom). Histograms of yk are 
compiled from 10 4 observations, with 100 shown. 


As with states, we can further simplify by discretizing the intensities, classifying as 
“dim” intensities below a given threshold and “bright” intensities above that threshold. 
“Dim” and “bright” then become two observation symbols. Because light scattering 
is itself a stochastic process, the protein can be in one state but emit the “wrong” 
symbol, as illustrated in figure |4](b). We can describe such a situation by defining the 
observations yk = ±1 and noting that they are related to the states probabilistically via 
an observation matrix B having components Bij = P(yk = i\xk = j). 


B = 




(4) 


where we suppose, for simplicity, that errors are symmetric. Because observations have 
no memory, the probability to observe yk depends only on the current state Xk■ 
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In words, the matrix B states that an observation is correct with probability 1 — 6 
and wrong with probability b. Like the transition matrix A, the matrix B is stochastic, 
with columns that sum to 1. Its rows also sum to 1, but only because of the symmetry 
between states. Note that the number of observation symbols, m, need not equal the 
number of internal states, n. The m x n matrix B can have m bigger or smaller than 
n. The case of continuous observations (m —y oo) is also straightforward. Larger values 
of m increase knowledge of the underlying state somewhat. 

One interesting feature of HMMs is that states Xk follow a Markov process and so 
does the combined process for Xk and ?/*,, but not necessarily the observations yk■ The 
analysis of HMMs is thus more difficult than for ordinary Markov processes. 

The literature on HMMs is both vast and dispersed. For treatments of increasing 
complexity, see section 16.3 of Numerical Recipes [19], the bioinformatics book by 
Durbin et al. 125, a classic tutorial from the speech-recognition literature BH the 
control-influenced book by Sarkka (27], and the mathematical treatment of Cappe et 
al. [23]- The tutorial by Rabiner BB has been particularly influential; however, its 
notation and ways of deriving results are more complicated than need be, and some 
of its methods have been replaced by better algorithms. The discussion here is based 
largely on the cleaner derivations in [27]. 

5. State estimation 

Hidden Markov models are specified by a transition matrix A and observation matrix B. 
Let us pose the following problem: Given the output of a hidden Markov model (HMM), 
what can be inferred about the states? The answer depends both on the information 
available and the exact quantity desired. Here, we focus on two cases: 

(i) Filtering , or P(xk\y k )- We estimate the probabilities for each state based on 
observations y k = {y±, 2 / 2 , - - -, yk} up to and including the present time k. Filtering 
is appropriate for real-time applications such as control Jj] 

(ii) Smoothing , or P(xk\y N ), for N > k. Smoothing uses data from the future as well 
as the past in the offline post-processing of N observations. 

Another quantity of interest is the most likely path , defined as arg max x .v P(x N \y N ), 
which may be found by an algorithm due to Viterbi CHI- For example, McKinney et 
al. study transitions between different configurations of a DNA Holliday junction, using 
fluorescence resonance energy transfer (FRET) to read out the states, and infer the 
most likely state sequence [28]. Since path estimates are less useful for feedback control, 
we will consider them only in passing, in section [8j We will also see that smoothing 
estimates provide a useful contrast with filter estimates. 

f An alternate notation for P(xk\y k ) is P{xk\yi-.k)- Our notation seems cleaner and easier to read. 
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The filtering problem is to find the probability distribution of the state x k based on the 
past and current observations y k from time 1 to time k. We assume that the dynamics 
have been coarse grained to be Markov, so that the state x k+ \ depends only on the state 
Xk- Then P(x k+ i\x k ,y^) = P(x k+ i\x k ), where the “cancel” slash indicates conditional 
independence: conditioning on Xk “blocks” the influence of all other variables. The x k ~ l 
are blocked, too: the state at time k + 1 depends only on the state at time k. 

From marginalization and the definition of conditional probability, we have 

P(x k+1 \y k ) = Y p ( x k+i,x k \y k ) 

Xk 

= Y P ( x k+l\ x k,/)P( x k\y k ) 

Xk 

= Y p { x k+i\ x k) p ( x k\y k ) • (5) 

Xk 

Equation ([5]) predicts the state x k +i on the basis of y k , assuming that the previous Elter 
estimate, P(x k \y k ) is already known. Once the new observation y k +i is available, we 
can use Bayes’ Theorem and the memoryless property of observations, P(y k \x k = 
p (yk\ x k), to update the prediction (J5j) to incorporate the new observation. Then, 

p (x k +i\U k+1 ) = -J— p (yk+i\x k +i,/) p (x k+ i\y k ), (6) 

^k +1 

where Z k+l normalizes the distribution. Equations (|5]) -(|6|) constitute the Bayesian 
filtering equations PZH2S]. Because of their importance, we collect them here: 


P(x k+ i\y k ) = Y p (xk+i\x k ) P(x k \y k ) 

predict 

i 

1 

p (x k+1 \y k+1 ) - P(y k+1 \x k+1 ) P(x k+1 \y k ) 

£k+l 

update. 


The normalization (partition function ) Z k +i is given by 

Z k+1 = P(y k+1 \y k ) = Y p (yk+i\x k +i) p {x k+l \y k ). (8) 

1 

Note that the HMM literature, e.g., ng and 123 , expresses (JTJ) differently, 
using joint probabilities such as P(x k ,y k ) rather than conditional probabilities such 
as P(x k \y k ). Using joint probabilities leads to the forward algorithm. Our notation 
emphasizes the similarities between HMM and state-space models of dynamics; the 
formulas of one apply mostly to the other, with J2 Xk ^ I ^ Xk ~ ^ or continuous 
state spaces with linear dynamics and Gaussian noise, dH) is equivalent to the Kalman 
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Figure 5. Filtering for a symmetric, two-state, two-symbol hidden Markov model 
with a = 0.2 and b = 0.3. Light gray line shows true state, which is hidden. Markers 
show 100 observations. Heavy black line shows the probability that the state equals 
+1, given by P(xk = 1| y k ). The maximum confidence level p* « 0.85 (dashed line). 


filter m- Below, we will see that using conditional probabilities also lias numerical 
advantages. 

Figure [5] shows filtering in action for a symmetric, two-state, two-symbol hidden 
Markov model. The time series of observations yk (markers) disagrees with the true 
state 30% of the time. The black line shows P(xk = 1| y k ). When that probability 
is below the dashed line at 0.5, the most likely state is 0. For the value of a used in 
the dynamic matrix (a = 0.2), the filter estimate x^? = arg niax Xj _ P(xk\y k ) disagrees 
with the observation only 7% of the time, a noticeable improvement over the naive 
30%. Notice that whenever the state changes, the filter probability responds, with a 
time constant set by both observational noise (6) and dynamics (a). A long string of 
identical observations causes filter confidence to saturate at p* (dashed line). 

There is an advantage to recording the probability estimates (black line) rather than 
simply the MAP (maximum a posteriori) estimate, which here is just the more likely 
of the two possibilities. When the filter is wrong, the two probabilities are often not 
that different. An example is indicated by the arrow in figure El Thus, marginalizing 
(averaging) any prediction over all possibilities rather than just the most likely will 
improve estimates. Of course, a string of wrong symbols can fool the filter. See, in 
figure El the three wrong symbols just to the left of the arrow. 

Below, we will see that the filtered estimate becomes significantly more reliable as 
a —> 0. Intuitively, small a means that states have a long dwell time, so that averaging 
observations over times of the order of the dwell time can reduce the effect of the 
observational noise, which is quantified by the parameter b. 

5.2. Smoothing 

If we estimate the state Xk after gathering N observations (N > k), we can use 
the “future” information to improve upon the filter estimate. In the control-theory 
literature, such estimates are called “smoother” estimates, as they further reduce the 
consequences of observation noise. 

The smoother estimate has two stages. First, we use the filter algorithm ([7]) to 
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calculate P(x k \y k ) and P(x k+ i\y k ) for each k G [1,1V]. Then we calculate P(x k \y N ) via 
a backward recursion relation from the final time N to the initial time 1. 


P(x t \y N ) = P{x k \y k ) V - 


(9) 


The backwards recursion relation is initialized by P(xjy, y N ), the last step of the forward 
filter recursion. 

To derive (19]) , we introduce the state £fc+i, which we will remove later by 
marginalization [57]. Thus, 


P(x k ,x k +i\y N ) = P(x k \x k+1 ,y N ) P(x k+1 \y N ). (10) 


But 


P(x k \x k+ i,y N ) = P(x k \x k+ i,y k ) 


P(x k ,x k+1 \y k ) 
P(x k+1 \y k ) 


P(x k+ i\x k ,y*) P(x k \y k ) 
P{x k+1 \y k ) 


( 11 ) 


using conditional probability and the Markov property. Substituting into (110]) . 


i jy. P(.*k I/) Hxt+iM f(t t +il»") 

( b wls )_ Pfe+ili/*) ' { ’ 

Summing both sides over x k+ i gives (J9]). 

The algorithm defined by (J7J) and (JH1) is equivalent to the Rauch-Tung-Striehel 
smoother from control theory when applied to continuous state spaces, linear dynamics, 
and white-noise inputs [27]. In the HMM literature, a close variant is the forward- 
backward algorithm [ 50] . 




Figure 6. Smoother estimates (black line) for two-state, two-symbol HMM with 
a = 0.2 and b = 0.3. Filter estimate is shown as a light gray trace. The simulation 
and filter estimate are both from figure [5] 


We can apply the smoother algorithm to the example of section 15.11 and obtain 
similar results. In figure [6] we plot the smoother estimate, with the filter estimate 
added as a light gray trace. Despite their similarity, the differences are instructive: The 
filter always lag (reacts) to observations, whereas the smoother curve is more symmetric 
in time. Flipping the direction of time alters the overall form of the filter plot but not 
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the smoother. The smoother estimates are more confident than the filter estimates, as 
they use more information. Look at the time step indicated by the arrow. The filter 
estimate is just barely mistaken, but the smoother estimate makes the correct call, aided 
by the three correct observations that come before and the three after. 

The phase lag apparent in the filter estimate is consistent with causality. Indeed, 
for continuous state spaces, the well-known Bode gain-phase relations—the “magnitude- 
phase” equivalent of the Kramers-Kronig relations [30J — give the minimum phase lag 
for the output of a dynamical system that is consistent with causality. The smoother 
estimate in figure [6] has zero phase lag, as expected since it uses past and future 
information equally. Sudden jumps are anticipated by the smoother before they happen. 

Intuitively, an estimator that uses more information should perform better. We 
can formalize this intuition via the notion of conditional Shannon entropy [HI]. With 
Pj = p ( x k = j I y k ), 

n 

H(x k \y k ) = -^pj log pj , (13) 

3 = 1 

where using a base-2 logarithm gives units of bits. For large-enough k, the average of 
H(xk\y k ) over y k becomes independent of k. Averaging over a single long time series 
of observations then leads to (H(xk\y k )) = H(x\ y), where V denotes past and present 
observations. A similar definition holds for the smoother entropy, H(xk\y N ) and leads 
to a steady-state smoother entropy H(x\*y*'), where V includes both past and future 
observations. To characterize the performance of filtering and smoothing, we recall that 
for a two-state probability distribution, the entropy ranges from 1 bit (equal probabilities 
for each possibility) to 0 bits (certainty about each possibility). 




Symbol error prob. (b) Symbol error prob. (b) 

00 (b) 


Figure 7. Smoother outperforms filter, (a) Shannon entropies of filter and smoother 
state estimates. The symmetric transition matrix A has parameter a = 0.02. (b) 
Filter minus smoother. Calculations use time series of length 10 5 . 


Figure Eta) shows the steady-state filter and smoother Shannon entropies as a 
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function of b, the error rate in the observation matrix B. At small values of a, the 
smoother has a greater advantage relative to the filter: when dwell times in each 
state are long, the information provided by averaging is more important. Figure [7](b) 
plots the difference between filter and smoother entropies. For 6 = 0, the difference 
vanishes: with no noise, the observation perfectly determines the state, and there is no 
uncertainty about it afterwards. For b = 0.5, the observations convey no information, 
and H(x\y) = H(x\*y*) = H(x) = 1 bit and the difference is again zero. For 
intermediate values of b, the smoother entropy is lower than the filter entropy. 

6. Learning hidden Markov models 

The state-estimation procedures described above assume that the transition matrix 
A, the emission matrix B, and initial probability P(x i) are known. If not, they 
can be estimated from the observations y N . In the context of HMMs, the task is 
called, variously, parameter inference , learning , and training |19j . In the control-theory 
literature on continuous state spaces, it is known as system identification [32], 

The general approach is to maximize the likelihood of the unknown quantities, 
grouped here into a single parameter vector 9. That is, we seek 

9* = argmaxP(y A, |d) = argmin [— In P(y^ |0)1 , (14) 

e e 

where it is better to compute L(0) = — In P(y N \9) because P(y^\9) decreases 
exponentially with N , leading to numerical underflow. The negative sign is a convention 
from least-squares curve fitting, where ;y 2 (0) is also proportional to the negative log 
likelihood of the data [19j . 

We can find the total likelihood P(y N \9) from the normalization condition in (J7J): 

N N 

p ( y N )=n p w/" 1 )=n z * ■ ( i5 > 

k=1 k= 1 

V -V-' 

chain rule 

where Z l = P(yf). Then 

N 

L (0] = -^2 ln ^ p {yk\xk) p(x k \y k ~ 1 ) , (16) 

fc=l 

where all right-hand-side terms depend also on 9. Since L{9) is just a function of 9 , we 
can use standard optimization routines to find the 9* that minimizes L. 

In the HMM literature, an alternate approach to finding 9* is based on the 
Expectation Maximization (EM), or Baum-Welch algorithm [22, [2D]. In a two-step 
iteration, one finds 9 by maximum likelihood assuming that the hidden states x N are 
known and then infers states x N from the smoother algorithm assuming 9 is known. 
The algorithm converges locally but can be very slow. Indeed, the EM algorithm can 
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seldom compete against the more sophisticated direct-optimization algorithms readily 
available in standard scientific programming languages. EM algorithms can, however, be 
the starting point for recursive variants that allow for adaptation [33] . A third approach 
to Ending HMM parameters, based on finding the most likely (Viterbi) path, can also 
converge faster than EM and be more robust [34] . 

7. Control of discrete-state-space systems 

We can now discuss the control of Markov models and HMMs. In the context of 
discrete state spaces, the control influences the transition probability, which becomes 
P(xk+i\xk, Uk) and is described by a time-dependent transition matrix A*,, and a 
graphical structure illustrated in figure |8] Note that our previous discussion of state 
estimation (filtering) never assumed that the transition matrix is time independent. 


hidden: 


observed: 





Figure 8. Partially observable Markov decision process graphical structure. The 
hidden states Xk+i form a Markov process whose transitions depend both on states Xk 
and observations yk- 


The control of Markov chains is formally known as a Markov Decision Process 
(MDP), while that of HMMs is known as a Partially Observable Markov Decision 
Process (POMDP). Optimal-control protocols that minimize some cost function can be 
found using Bellman’s dynamic programming, which is a general algorithm for problems 
involving sequential decisions [19] [35] . In this setting, control is viewed as a blend of 
state estimation and decision theory [35] [36]. The goal is to choose actions based on 
available information in order to minimize a cost function. 

Here, we will present such ideas more informally, using a well-studied example: 
optimal work extraction from a two-state system with noisy observations and feedback. 
This problem is closely related to a famous thought experiment (recently realized 
experimentally [37]), Maxwell’s demon. 

7.1. Maxwell’s demon 

As discussed in the introduction, a Maxwell demon is a device where information about 
the state of a system is used to extract energy from a heat bath, in violation of the 
traditional form of the Second Law of thermodynamics. How is this possible? The catch 
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is that we have assumed that information carries no cost. A first attempt at resolving the 
paradox hypothesized that energy is dissipated in acquiring information |T] . However, 
that turns out not to be true in general: one can sometimes acquire information without 
doing work. In its Kelvin-Planck formulation, the Second Law requires that no cyclic 
protocol of parameter variation can extract work from the heat bath of an equilibrium 
system held at constant temperature. Specifying a cyclic protocol can be subtle. Naively, 
a cyclic protocol requires that any potentials that are changed must be returned to their 
initial state; any mechanical part (pistons, etc.) that are moved must be moved back; 
and so on. But it also applies to information. In particular, any information acquired 
must be erased. In 1961, Landauer proposed that the erasure step necessarily required 
energy dissipation of at least kT In 2 per bit, an amount that equals or exceeds the 
amount of work that can be extracted, thus saving (or extending) the Second Law [38 ] . 
Landauer’s prediction has recently been confirmed experimentally [391 00], as has its 
converse, the Szilard engine, which uses acquired information to extract work from a 
heat bath [37101102]. 



t = 0 0 < t <T t = T 


Figure 9. Converting information to work in a two-state system that hops back and 
forth between “left-well” and “right-well” states separated by a high energy barrier E\,. 
If the system is observed to be in its right-well state, then we can raise the left well 
without doing work. After a time r, the well is lowered. If the left state is occupied, 
we extract an energy E that can be used to perform work. 


7.2. A simple model, with fully observed states 

We consider a particle in a fluid, subject to a double-well potential that may be 
manipulated by the experimenter (figure [9]). It is a useful setting for thinking about 
the issues raised by a Maxwell demon and is a situation that can now be realized 
experimentally [39] 00]. We assume that the energy barrier is large (E^ kT), so that 
we can coarse grain to two-state Markov dynamics, as discussed in section 01 Henceforth, 
we set kT = 1. At intervals r, we observe the state of the system and record which well 
the particle is in. For now, we assume this measurement is never wrong. 

To extract work from a heat bath, we implement the following protocol: At t — 0, 
the potential is symmetric, with no energy-level difference between left and right wells. 
We then observe the particle. If we determine it to be in the right well, then, with no 
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significant time delay, we quickly raise the left well to an energy E (and vice versa if in 
the left well). Raising the left well costs no work if we change the potential only where 
the particle is not present. From Sekimoto’s formulation of stochastic energetics, the 
work done by an instantaneous change of potential is just A U, the change of potential 
evaluated at the position of the particle [43] . 

We then wait a time r, keeping fixed the energy E of the left well. At some time, 
the particle may spontaneously hop to the left well, because of thermal fluctuations. At 
time r, the left well is quickly lowered back to E = 0. If the particle happens to be in 
the left well, we extract an energy E from the heat bath. If not, no energy is extracted. 
Summarizing, the protocol is to measure the state; then raise the appropriate well by 
E and wait r; then lower the well back to 0. 

Over many trials, the average extracted work (W) is given by Ep T , where p T is 
the probability for the particle to be in the left well at time r. But p T also depends on 
E. To evaluate the relation, we consider the continuous time dynamics of the state of 
the system, allowing hops between states at arbitrary times t but still considering the 
hops themselves to be instantaneous. The discrete-time master equation p^+i = A p^. 
then becomes p = 4p, where the matrix A has columns that sum to zero, to keep 
p normalized at all times. Normalization implies that a two-state system has but one 
independent evolution equation, pit), which obeys 


p = -uj-p + cj + (1 — p ), 


(17) 


where is the transition rate out from the left well and u + is the transition rate into 
the left well. In equilibrium, detailed balance requires that uj+/u = e~ E . Scaling time 
so that u>- = 1 then gives 


p = —p + e E (1 - p). 


(18) 


Setting p = 0 gives the steady-state solution p^ = l/(e E + 1). Notice that E = 0 
implies p 0 0 = |, as expected for a symmetric double-well potential, and that E —> oo 
implies that the particle is always in the right well (jp 0 0 — > 0). For finite times, we solve 
(USD with p 0 = 0. The solution, p T = p^fl — e~i 1+ul i T \, implies that 


(W) = 


e E + 1 


1 - e~ (1+e E)t 


(19) 


Note that we choose signs so that fW) > 0 corresponds to work extraction. 

Intuitively, for a given cycle time r, an optimal energy E* maximizes the average 
work: if E is too small, you will extract work in many cycles, but the amount each 
time will be small. If E is too large, you will extract more work, but only very rarely, 
since the relative probability of being on the left side is < e~ E . For the quasistatic limit 
r>l, (yV) ~ E/(e E + 1), whose maximum (W)* ~ 0.28 for E* « 1.28. 

The second law of thermodynamics implies that (kF) < A F, where the free energy 
difference A F is just the difference in entropy AS, since the internal energy difference 
is zero for a cyclic process where the energies of both states are identical at beginning 
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and end. The maximum entropy difference is In 2 0.69, which is considerably larger 

than the ~ 0.28 found in the quasistatic limit of our protocol. 

To achieve the In 2 upper bound for extracted work per cycle, we need to allow 
E{t) to vary continuously in the interval 0 < t < r (and to have jump discontinuities 
at the beginning and end of the interval). Such continuous-time protocols have been 
considered previously and lead to protocols that extract In 2 of work in the quasistatic 
limit |HJ 05j123]- Nonetheless, we prefer our constant-H protocol: 

• The mathematics is simpler. The continuous version uses calculus of variations. 
The discrete one requires only ordinary calculus. 

• If implemented experimentally, the protocols would almost certainly be carried out 
digitally, with an output that is fixed between updates. 

• When the goal is to optimize power extraction from the heat bath (rather than 
work per cycle), the constant-A 1 and continuous protocols give identical results. 

To explore this last point, we rewrite (fT9l) for average power, V = (W)/r. 
Assuming, as a more careful analysis confirms, that maximum average power extraction 
occurs when r < 1 , we have 

[h + °- E )T]=Ee~ E , ( 20 ) 

which has a maximum V* = 1/e ~ 0.37 for E* = 1. The same result is found for 
the continuous protocol (23|. Since maximum energy extraction requires quasistatic, 
infinitely slow manipulations, the power at maximum energy tends to zero. Maximizing 
power extraction is arguably more interesting experimentally. 

7.3. Hidden states 

So far, we have assumed noise-free observations. If the observations are noisy, we have 
to infer the probability p(0) = po that the particle is in the left well. Assuming that the 
particle is likely in the right well (0 < po < - 7 ), then we should raise the left well. After 
a time r has elapsed, (fT 8 l) implies that 

Pt=Poc- (Poc ~ p 0 )e~ (1+uj)T = ^ ^ - ep 0 , (21) 

i + oj 

with uj = e~ E and £ = e~^ 1+u ^ T . This expression is linear in p 0 , as the master equation 
dim is linear. The discrete-time master equation for time step r then is 

fa f3 \ / po \ = / ap 0 + P(1 - p 0 ) \ _ ( (3 + (a - (3)p 0 \ 

\l-a l-P) yl-poy \ v (l-a)p 0 + (l-/3)(l-po)y \J1 ~P) ~ (« ~ P)PoJ 

'-V-' 

Al 


( 22 ) 
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Matching terms with (12T|) gives /3 = and a = The complements are 

1 — j3 = and 1 — a = -y-W Thus, when the left well is raised, the transition 
matrix Al (E,t) is 


A Tj = 


1 f cu + e cu{l — e) 
1 + cu 1 1 — e l + cue 


(23) 


Notice that the columns of Al sum to one, as they must and that the Markov transition 
matrix is no longer symmetric, as expected since we raise one of the wells. The novel 
aspect for us is that the transition matrix Al now depends on the energy level E, which 
can be set at each time step. 

When the right well is raised, matrix elements are switched, with left right. This 
amounts to swapping “across the diagonal” of the matrix. Thus, 


Ar — 


1 A cue 1 


1 -(- cu \ cu (1 — CU A £ 


(24) 


The previously analyzed case (ITU]) for p 0 = 0 then represents the best-case scenario: 
the particle is definitely on the right, and there is never a penalty for raising the left 
well. For 0 < p 0 < |, we will occasionally do work in raising the well when the particle 
is present. Using (12T| and maximizing over E, we can quickly calculate the maximum 
work extraction as a function of pq. Figure (Ha) shows that the maximum average 
extracted work decreases as the initial state becomes more uncertain. When p 0 = |, we 
have no information about the state of the system and cannot extract work from the 
heat bath, in accordance with the usual version of the second law. For p 0 > we would 
raise the right well, else we would be erasing information and heating the bath, rather 
than extracting energy from it. Figure ITUT b) shows that the work extracted is nearly a 
linear function of the change in Shannon entropy between initial and final states. As in 
Szilard’s analysis, information was used to extract work from the heat bath. Here, the 
average slope (converted to nats) gives an efficiency of roughly 41%. Less than half the 
information gained is extracted as work by this particular protocol. 


7.4. Two protocols 

We have not yet specified how to estimate po at the beginning of each time interval. 
We do so via the observations yk that are made at the beginning of each control period 
r, before the choice of E. The observations have two symbols and are characterized by 
an observation matrix of the form of (J4]), with b the symbol error rate. We thus return 
to the formalism discussed in section [5l where p 0 —* P(xk), the state of the system at 
time k. Similarly, p T — y P(xk+ 1 ). The only difference is that we modify A by choosing 
E and which well to raise at each time step. Call the choice A*,. 

We can incorporate observations in two ways. One is to use only the observation 
to estimate P(xk). Then Bayes’ Theorem implies that P(xk\yk) oc P(pk\xk), where the 
prior P(xk) = |, since left —>■ right and right —> left state transitions are equally likely. 
Although P(xk+i\xk, Uk) does not satisfy this condition, the time-averaged sequence of 
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Figure 10. Maxwell demon extracts work in the quasistatic limit r> 1. (a) Average 
work ( W) vs. probability to be in the left well at time 0. (b) Vs. information gain. 


transition matrices does: since left and right levels are raised at equal frequencies, the 
overall statistics are symmetric in the absence of other information, ffere, Uk is the 
control variable, a function of E. 

The second way is to use the filtering formalism developed in section 15.11 to 
recursively compute P(x k \y k ). (Without information about the future, we cannot use 
smoothing.) We can say that the second strategy, which depends on past observations, 
uses memory whereas the first uses no memory. The procedure is then to 

• Measure y k . 

• Update P(x k \y k ), based on {A fc , B}, with the time-dependent transition matrix A k 
given by P(x k +i\x k , u k ). The control u k is a function of E k . 

• Determine E k+1 by minimizing (W)(E), the average work extracted in a cycle. 

• Apply u k+l . 

Iterated, the above algorithm leads to plots of the average extracted work as a 
function of the measurement-error probability b (figure QJ] ). In (a), the curve labeled 
memory , uses the Bayesian filter to estimate the state of the system. By “memory,” we 
mean that the inference about which energy level to alter is based on all the observations 
y k up to time k. By contrast, in (b), the “no memory” curve uses only the current 
observation, y k . As before, the extra information from past states is most useful at 
intermediate values of error rate b. The difference curve, plotted at left below, resembles 
figure [3 which compared estimator entropies of the smoother and filter state estimates. 
The conclusion, again, is that extra information is most useful at intermediate signal- 
to-noise ratios. Here, retaining a memory of past observations via the filter allows the 
Maxwell demon to extract more power from the heat bath. 
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Figure 11. Maxwell demon extracts power, (a) Comparison of power extracted using 
past and present states y k to that using only the current state yk- (b) Difference 
between the two extracted powers. Cycle time t = 0.1. 


7.5. Phase transition in a Maxwell demon 

The continuous-protocol version of the Maxwell demon shows phase transitions in the 
behavior of the Maxwell demon as the symbol error rate b is varied [24j. To see that 
similar phenomena arise in the constant-if protocol discussed in this paper, compare 
the outcomes of the strategy that uses memory ( y k ) with one using no memory (yk)- 
More precisely, we define a “discord” order parameter V, 

V = l-(yx), (25) 

where y = ±1 represents the time series of observations and x — ±1 represents the 
state estimate, based in this case on the optimal filter J§] If y and x always agree, V = 0. 
If y and x are uncorrelated, T> — 1. Partial positive correlations imply 0 < T> < 1. 
Put differently, T> > 0 implies that there is value in having a memory, as the filter 
estimate x can differ from the observation. When T> = 0, the filter always agrees with 
the observation, implying that there is no value in calculating the filter. 

In figure fl2f a). we plot the discord order parameter T> against the symbol error 
rate b for three different cycle times, r = 0.1, 1, and 10. There are many interesting 
features. For long cycle times, represented by r = 10 and hollow markers, observations 
match the inferred state—defined here to be the more likely state, as determined by 
the probabilities from the filter algorithm. For intermediate cycle times, represented 
by r = 1 and red markers, there is a continuous bifurcation, or second-order phase 
transition, indicated by an up-pointing red arrow at b = b c « 0.258. (The apparent 
discontinuity results from the limited resolution of the plot. At higher resolution, not 
shown, the bifurcations are clearly continuous.) For b < b c , the filter estimate and 
observation always agree. For b > b c , they disagree sometimes. For short cycle times, 

§ This order parameter has nothing to do with the quantum discord order parameter that is used to 
distinguish between classical and quantum correlations [46] . 
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Maxwell HMM 




(a) 


(b) 


Figure 12. Phase transition in discord order parameter, (a) Maxwell demon, for three 
different cycle times r. Black down-pointing arrows mark jump discontinuities. Red 
up-pointing arrow marks a continuous phase transition, (b) Similar plot for HMM, for 
three values of transition matrix parameter a. 


represented by r = 0.1 and black markers, we observe two transitions that, upon closer 
inspection, are both discontinuous, corresponding to first-order phase transitions and 
marked by down-pointing black arrows. Finally, at b = 0.5, the order parameter T> — 1, 
since there is no correlation between observation and the internal state (or its estimate). 
Interestingly, there is always a jump discontinuity in D at b = 0.5. 

8. Phase transitions in state estimation 

The phase transition observed in the Maxwell-demon model given in the previous section 
can also be seen in hidden Markov models that have nothing to do with thermodynamics. 
Figure fl2l b) shows the discord order parameter V for a two-state, two-symbol HMM 
with x, y G { — 1,+1}, for three values of a. As in figure Il2lfa). there are first-order 
transitions for small values of a, continuous transitions for intermediate values, and no 
transitions for larger values. Intuitively, we need long dwell times in states (low values 
of a) so that we have time to average over (filter) the observation noise. If so, we may 
be confident in concluding the true state is different from the observed state. If the 
dwell time is short (high value of a), the best strategy is to trust the observations. Note 
that the values of a correspond roughly to the same regimes as implied by the values of 
r; however, we cannot make an exact mapping, since the Markov transition rate in the 
Maxwell-demon depends on the control Uk , which depends on observation errors b. 

As with the Maxwell-demon example, for given a there is a critical value of b, 
denoted b c . To calculate b c , we note that there is an upper limit to the confidence one 
can have in a given state estimate. As we can see in figure [Sj this limit is achieved after 
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a long string of identical observations, say y k = 1 , that is {yi — 1 , y 2 — 1 ,..., y^ — 1}. 
See the string of eight +1 states in figure [5] as an example. More formally, we consider 
P{xk = 1| y k = 1). For fc » 1, the maximum value of the state probability approaches 
a fixed point p* at long times. The intuition is that even with a long string of +1 
observations, you cannot be sure that there has not just been a transition and an 
accompanying observation error. We derive p*(a , b) in Appendix A and plot the results 
in figure fl3f a). 



Figure 13. (a) Maximum confidence level p* as a function of symbol error probability 
b for Markov transition probability a = 0.01, 0.1, 0.2, 0.3, 0.4, 0.5. Dotted lines show 
(a = 0.2, b = 0.3) case, (b) Critical value of symbol error probability, b c for filter 
(solid markers) and smoother (hollow markers), vs. Markov transition probability a. 
Simulations as in figure [TST bl. with 1000 time units. For fixed a, the parameter b is 
incremented by 0.01 from 0 until T> > 0.001, which defines b c . Solid lines are plots of 
(l27l) and (l28l) . No parameters have been fit. 


Let us denote fjp = argmax P(xk\y k ), the filter estimate of x To find conditions 
where disagrees with y, we construct the extreme situation where a long string y k = 1 
gives the greatest possible confidence that x^ — 1. Then let yk+i = — 1. The discordant 
observation must lower the confidence in Xk +i to below | in order for the filter estimate 
and observation to disagree. Thus, the condition defining b c is 


ppt+i= ifc+i = -i,y=i) = o ■ 


(26) 


Writing this condition out explicitly gives, after a calculation detailed in Appendix A[ 


b mter (1 _ . 

A similar calculation for the smoother, again detailed in|Appcndix A leads to 


r smoother 


1 

2 


1 - 


y(l + fl)(l — 3a) 


1 — a 


(27) 


(28) 
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Figure [T51 b) shows that the thresholds of simulated data agree with (127)) and (125)) . 
Both filter and smoother estimates imply that there is a maximum value of a, call it a c , 
above which D = 0 for all b. For the filter a c = while for the smoother, a c = The 
higher value of a c reflects the greater value of smoother vs. filter inferences. 


8.1. Mapping to Ising models 


Although we have explained some features of figure [T2l there is clearly more to 
understand. For example, there are both continuous and discontinuous transitions, as 
well as evidence for multiple transitions at fixed a. To begin to understand the reason 
for multiple phase transitions, we note that the two-state, two-symbol HMM can be 
mapped onto an Ising model [07) 08] . Let us change variables: 


P(x k+1 \x k ) 
P{Uk\xk ) 


^J%k-\-l x k 

2 cosh J ’ 

g/l Dk x k 

2 cosh h ’ 



(29) 


We use these definitions to formulate a “Hamiltonian” H = — In P(x N , y N ) via 


N N 

H — —J x k x k+ i - h y k x k , (30) 

k =1 k =1 

where we have dropped constant terms that are independent of x k and y k . For a < |, 
the interaction term J > 0 is ferromagnetic: neighboring “spins” tend to align. The 
term h corresponds to an external held coupling constant. The held hy k is of constant 
strength and, for b < |, has a sign is equal to the observation y k . The picture is that 
a local, quenched held of strength hy k tries to align its local spin along the direction 
dehned by y k . Notice that h = 0 for b = spins are independent of y k : observations 
and states decouple. A further change of variables (gauge transformation), z k = y k x k 
and r k = y k y k+ 1 , gives 


H{t,z) 



7~k Zk%k+1 



k 


k 


(31) 


which is a random-bond Ising model in a uniform external held h [49]. 

Starting in the late 1970s, both random-bond and random-held one-dimensional 
Ising chains were extensively studied as models of frustration in disordered systems 
such as spin glasses. In particular, Derrida et al. showed that the ground state at zero 
temperature has a countable infinity of transitions at h — 2 J/m for m — 1, 2,..., oo 
[50] . Their transfer-matrix formalism is equivalent to the factorization of the partition 
function Z = [Q fc Z k given in ()T5l) . 

The lowest-order transition, h = 2J, corresponds to a case where the external held 
at a site forces the local spin to align, because we are at zero temperature. In terms of 
the original HMM problem, the ground state corresponds to the most likely (Viterbi) 
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path discussed briefly in section 0 [IS]- While the Viterbi path differs from the filter 
estimate considered here, there may be a similar explanation for the multiple transitions 
apparent in figure d2j 

9. Discussion 

The formalism of hidden Markov models, or HMMs, can both simplify and clarify 
the discussion of stochastic thermodynamics of feedback using noisy measurements. 
Expressed in terms of the control-theory notation developed here, state estimation based 
on HMM formalism is an effective way to incorporate the effects of noisy measurements. 
As an application, we simplified a previous analysis of a Maxwell demon that uses 
observations to rectify thermal fluctuations. We saw that a surprising phase transition 
in the “discord” between observation and inferred state is also present in simple HMM 
models. At least in this case, the primary source of complexity seems to he in the 
process of state estimation, rather than some feature of the thermodynamics. 

Our study of phase transitions in the discord parameter follows the methods of 
Bauer et al. [21]; however, the mathematics is considerably more complicated in that 
case. We note that while Bauer et al. do observe a series of transitions in their numerics, 
they have not seen evidence for jump discontinuities (private communication). Perhaps 
the differences are also associated with the continuous protocol for varying E. More 
investigation is warranted. 

Beyond simplifying specific calculations, the use of HMMs leads to other insights. 
For example, in figure [HI we saw that using a memory improves the performance of a 
Maxwell demon that extracts power from a heat bath. The greatest improvement was 
for intermediate values of the noise parameter b. Sivak and Thomson, studying a simple 
model of biological sensing, reached a similar conclusion |5T] . 

The results presented here suggest a somewhat broader view. Figure [7] shows a 
similar result, where the smoother estimate outperforms the filter estimate. Here, 
performance is measured by the Shannon entropy of the estimated probability 
distribution. Again, we see that the best performance, relative to without memory, is at 
intermediate noise levels. Indeed, a variety of similar results can be obtained from many 
analogous quantities. For example, filter estimates based on continuous measurements 
with Gaussian noise also exceed those based on discrete observation measurements, with, 
again, a maximum at intermediate values of observation noise. 

The common feature in all these different examples is that we compute some 
measure of performance—work extraction, Shannon entropy, etc.—as a function of 
added information. This added information can be previous observations (“memory”), 
offline observations, extra measurement precision, multiple measurements, and so on. 
In all cases, the greatest improvement is always at intermediate noise levels or, more 
precisely, at intermediate levels of signal-to-noise ratio. Intuitively, the observation 
makes sense: if information is perfect (zero noise), then more is superfluous. If 
information is worthless (zero signal), then more is again not better. But in intermediate 
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Extra information is most useful at moderate signal-to-noise ratios. 

It would be interesting to try to formalize these ideas further by defining a kind of 
“information susceptibility” in terms of a derivative of power extraction, etc. with 
respect to added information. In this context, it is worth noting the study by Rivoire 
and Leibler, who show that the value of information can be quantified by different 
information theoretic quantities, such as directed and mutual information, when the 
analysis is causal or acausal [52] , 

Finally, we note that while we have been careful to discuss the smoother as an 
offline analysis tool whereby data is analyzed after the fact, there are more interesting 
possibilities. As stochastic thermodynamics is generalized to accommodate information 
flows, we should also consider the equivalent to open systems. For quantities such 
as energy, we are used to the idea that a subsystem need not conserve energy and 
that we must account for both energy dissipation and energy pumping. Analogously, 
for information, we should consider both dissipation and the consequences of added 
information. Because such information comes from “outside” the system under direct 
study, causality need not be respected. For example, consider the problem of controlling 
the temperature of a house. A causal control system will simply respond to temperature 
perturbations after they occur. If it gets cold, the heater turns on. On the other hand, 
we know in advance that at night it gets cold, and we know, with effectively absolute 
certainty, the time the sun will set. Thus, we can anticipate the arrival of a cold 
perturbation and start to compensate for its effects before they occur. The resulting 
performance gain will be precisely analogous to the results shown in figure El where 
we compare filter and smoother estimates. (The quality of state estimates limits the 
quality of control.) 

The analysis of noisy discrete dynamics of HMMs is perhaps the simplest non-trivial 
setting where these ideas may be explored. More generally, outside influences will appear 
as additional inputs to a state node in a graphical representation. In this context, 
the Bayesian treatment of causality due to Pearl shows how to generalize inferences 
such as filtering and smoothing to Bayesian networks, which have a richer graphical 
structure than the chain-like Markov and HMMs sketched in figures [2] |4j and I8l [53] [36]. 
Such techniques have been used in stochastic thermodynamics to study information 
thermodynamics on networks [54] and would seem to be the right approach to studying 
systems that are “causally open.” 

In conclusion, we have introduced some of the properties of hidden Markov models 
that make them useful for simplifying the analysis of stochastic thermodynamics in 
the presence of feedback and noisy measurements, and we have seen how they suggest 
interesting areas for future research. 
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Appendix A. Calculation of phase transition critical line 

In the a-b parameter plane, the critical line b c (a) defines the border between the T> = 0 
and T> > 0 phases. Informally, the line separates a region where there is no benefit to 
using the filter estimate from one where there is. We can use both filter and smoother 
state estimates to calculate T>, giving two different critical lines. 


Appendix A.l. Filter case 

For the filter case, we first calculate the maximum confidence p*. From (17]) , 

P(x k = 1| y k = 1) = tt P{Vk = l|£fc = 1) y P(x k = l\x k -i)P(x k -i\y k ~ 1 = 1). 

V _ ' ✓ ZjI ^ L ' 

* 1 7 Xfo — \ 

p* 1 —b K 1 

(A.l) 

Substituting for the matrix elements in (lA.lj) . evaluating the normalization constant 
(T8j), and imposing the fixed point gives a quadratic equation for p*: 

* = _ (1 ~ b) [(1 -a)p* + a(l -p*)] _ , A 2 ) 

(1 — b) [(1 — a)p* + a(l — p*)\ + b [(1 — a)(l — p*) + ap*] 1 

whose solution is 


1-2 b + a(4b - 3) + ^a 2 + (1 - 2a) (1 - 2b) 2 
2(1 — 2a)(l — 2b) 


(A.3) 


For example, a = 0.2 and b = 0.3 gives p* ~ 0.852, which matches the upper bound 
in figure El See also figure H3la) in the main text. 

In terms of p*, the condition for the threshold b c is given by 


P{x k+ 1 = 11 y k+1 = -1, y k = 1) 

_ P(Vk+ i = ~l|gfc+i = l,lA^rT) P(x k+ 1 = 1| y k = 1) 
P(Vk +1 = ~Mv k = !) 

_ P(yk+ 1 = — l|ar fe +i = 1) p{x k+ 1 = l\y k = 1) 

E IH1 P(yk +1 = -l^fc+i) P(x k+1 \y k = 1) 

b[(l - a)p* + a(l - p*)} 

6[(1 — a)p* + a(l — p*)] + (1 — b) [ap* + (1 — a) (1 — p*)] 

1 

“ 2 ’ 


(A.4) 
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b(l-a+ -/a 2 + (1 — 2a)(l — 2b) 2 ^j 1 

(1 -2b)(l + a- /a 2 + (1 — 2a)(l — 26) 2 j 2 

Rearranging and squaring leads to a remarkable simplification, 

(1 — 2b) (b 2 — b + a) = 0 , 

which has solutions b — \ and b — |(1 ± /I — 4a). The relevant solution for the phase 
transition has b < |, which corresponds to the negative root and (1271) . 

Appendix A.2. Smoother case 

For the smoother, the analogous threshold condition is given by 

i 5 (^ = l|» = -l,</' n ‘ = l) = i, (A. 7) 

where y N \ k = {y u y 2 ,..., y k -i, Vk+i, ■ ■ ■, Vn} = {y^^k+i}, be., all the observations 
except yk. For the smoother, the future observations are also +1. In words: if an 
observation contradicts both past and future, do we trust it? We write 

P(x k = l\y k = -1 ,y N \ k = 1) = ^P(yk = -1 \x k = 1 )P(x k = l\y N \ k = 1). (A.8) 

We then focus on the second term, 

P(x k = l\y N \ k = 1) = P(x k = I)/- 1 = l,y fc v +1 = 1) 

= ^P(Vk + i = 1 \x k = 1,/^rT) P(x k = 1| y k ~ l = 1) 

= ip(x fc = l\y k+ i = 1) (P(yf +1 = 1 )/P(x k = 1)) P{x k = II/- 1 = 

= ^P{x k = 1| y k+1 = 1) P(x k = II/" 1 = 1) 

= ip(x fc = l|/- 1 = l) 2 , (A.9) 

where we absorb P(y/ 1 = 1) and P(x k ) into Z and set P(x k \yjf +1 ) = P(x k \y k ~ l ). 
The justification of this last step is that the sole difference in the two conditional 
probabilities is P(x k+ i\x k ) —$■ P(x k \x k +i)- But these are equal, as Bayes’ theorem 
(or detailed balance) shows: 


(A.5) 


(A.6) 


P(x k+1 \x k ) = P(x k \x k+1 )P(x k+1 )/P(x k ) = P(x k \x k+ 1 ), (A.10) 


where the unconditional probabilities P{x k ) = P(x k+ \) 


l 

2 ' 
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\p(yk = -IK = 1) {P(x t = II/- 1 )] 2 = I. (A.11) 

Using our earlier results for the filter, flA.41) . and with p* given by (1A.3I) . we have 

L _ ( )- _ 

U (a+p*—2ap*) 2 +(l—a—p*+2ap*) 2 

r _ (q+p*-2ap*) 2 _, (-1 _ 7 \ _ (l-a-p*+2ap*) 2 _ 

(a-\-p*— 2ap*) 2 + (l— a— p*+2ap*) 2 V ' (a+p*— 2ap*) 2 -f-(l— a—p* +2ap*) 2 

Again, an amazing simplification leads to ([28]) . That there are such simple solutions to 
such complicated equations suggests that a more direct derivation might be found. 
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